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Abstract: 

Crop models of crop growth are increasingly used to quantify the impact of 
global changes due to climate or crop management. Therefore, accuracy of 
simulation results is a major concern. Studies with ensembles of crop 
models can give valuable information about model accuracy and 
uncertainty, but such studies are difficult to organize and have only 
recently begun. We report on the largest ensemble study to date, of 27 
wheat models tested in four contrasting locations for their accuracy in 
simulating multiple crop growth and yield variables. The relative error 
averaged over models was 24-38% for the different end-of-season 
variables including grain yield (GY) and grain protein concentration (GPC). 
There was little relation between error of a model for GY or GPC and error 
for in-season variables. Thus, most models did not arrive at accurate 
simulations of GY and GPC by accurately simulating preceding growth 
dynamics. Ensemble simulations, taking either the mean (e-mean) or 
median (e-median) of simulated values, gave better estimates than any 
individual model when all variables were considered. Compared to 
individual models, e-median ranked first in simulating measured GY and 
third in GPC. The error of e-mean and e-median declined with an 
increasing number of ensemble members, with little decrease beyond 10 
models. We conclude that multimodel ensembles can be used to create 
new estimators with improved accuracy and consistency in simulating 
growth dynamics. We argue that these results are applicable to other crop 
species, and hypothesize that they apply more generally to ecological 
system models. 
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Abstract 

Crop models of crop growth are increasingly used to quantify the impact of global changes 
due to climate or crop management. Therefore, accuracy of simulation results is a major 
concern. Studies with ensembles of crop models can give valuable information about model 
accuracy and uncertainty, but such studies are difficult to organize and have only recently 
begun. We report on the largest ensemble study to date, of 27 wheat models tested in four 
contrasting locations for their accuracy in simulating multiple crop growth and yield 
variables. The relative error averaged over models was 24-38% for the different end-of-season 
variables including grain yield (GY) and grain protein concentration (GPC). There was little 
relation between error of a model for GY or GPC and error for in-season variables. Thus, 
most models did not arrive at accurate simulations of GY and GPC by accurately simulating 
preceding growth dynamics. Ensemble simulations, taking either the mean (e-mean) or 
median (e-median) of simulated values, gave better estimates than any individual model when 
all variables were considered. Compared to individual models, e-median ranked first in 
simulating measured GY and third in GPC. The error of e-mean and e-median declined with 
an increasing number of ensemble members, with little decrease beyond 10 models. We 
conclude that multimodel ensembles can be used to create new estimators with improved 
accuracy and consistency in simulating growth dynamics. We argue that these results are 
applicable to other crop species, and hypothesize that they apply more generally to ecological 
system models. 
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Introduction 

Global change with increased climatic variability are projected to strongly impact crop and 
food production, but the magnitude and trajectory of these impacts remain uncertain (Tubiello 
et al, 2007). This uncertainty, together with the increasing demand for food of a growing 
world population (Bloom, 2011), has raised concerns about food security and the need to 
develop more sustainable agricultural practices (Godfray et al., 2010). More confident 
understanding of global change impacts is needed to develop effective adaptation and 
mitigation strategies (Easterling et al., 2007). Methodologies to quantify global change 
impacts on crop production include statistical models (Lobell et al., 2011) and process-based 
crop simulation models (Porter & Semenov, 2005), which are increasingly used in basic and 
applied research and to support decision making at different scales (Angulo et al., 2013, 
Challinorei al, 2009, Ko et al., 2010, Rosenzweig et al, 2013b). 

Different crop growth and development processes are affected by climatic variability via 
linear or non-linear relationships resulting in complex and unexpected responses (Trewavas, 
2006). It has been argued that such responses can best be captured by process-based crop 
simulation models that quantitatively represent the interaction and feedback responses of 
crops to their environments (Bertin et al., 2010, Porter & Semenov, 2005). Wheat is the most 
important staple crop in the world providing over 20% of the calories and proteins in human 
diet (FAOSTAT, 2012). It has therefore received much attention from the crop modeling 
community and over 40 wheat crop models are in use (White et al., 201 1). These differ in the 
processes included in the models and the mechanistic detail used to model individual 
processes like evapotranspiration or photosynthesis. Therefore, a thorough comparative 
evaluation of models is essential to understand the reliability of model simulations and to 
quantify and reduce the uncertainty of such simulations (Rotter et al., 2011). 
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The Wheat Pilot study (Asseng et al, 2013) of the Agricultural Model Intercomparison 
and Improvement Project (AgMIP; Rosenzweig el al, 2013b) compared twenty-seven wheat 
models, the largest ensemble of crop models created to date. The models vary greatly in their 
complexity and in the modeling approaches and equations used to represent the major 
physiological processes that determine crop growth and development and their responses to 
environmental factors, see Table S3 in supplemental in Asseng et al. (2013). 

An initial study (Asseng et al., 2013) analyzed the variability between crop models in 
simulating grain yield (GY) under climate change situations without specifically investigating 
multimodel ensemble estimators considering other end-of-season and in-season variables to 
better justify their possible application. The present analysis uses the resulting dataset to study 
how the multimodel ensemble average or median can reproduce in-season and end-of-season 
observations. In its simplest and most common form, a multimodel ensemble simulation is 
produced by averaging the simulations of member models weighted equally (Knutti, 2010). 
This method has been practiced in climate forecasting (Hagedorn et al, 2005, Raisanen & 
Palmer, 2001) and in ecological modeling of species distribution (Grenouillet et al, 2011), 
and it has been shown that multimodel ensembles can give better estimates than any 
individual model. Such improvement in skill of a multimodel ensemble may be also 
applicable to crop models. Preliminary evidence suggests that the average of ensembles of 
simulations is a good estimator of GY for several crops (Bassu et al., 2014, Palosuo et al., 
2011, Rotter et al, 2012) and possibly even better than the best individual model across 
different seasons and sites (Rotter et al., 2012). However, a detailed quantitative analysis of 
the quality of simulators based on crop model ensembles, compared to individual models is 
lacking. By looking at outputs of multiple growth variables (both in-season and end-of- 
season), we would get a broader picture of how ensemble estimators perform and a better 
understanding of why they perform well compared to individual models. It is important 
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therefore to consider not only GY but also other growth variables. If multimodel ensembles 
are truly more skillful than the best model in the ensemble, or even simply better than the 
average of the models, then using ensemble medians or means may be a powerful estimator to 
evaluating crop response to crop management and environmental factors. 

Model evaluations can give quite different results depending on the use of the model that is 
studied. Here we investigate the situation where models are applied in environments for 
which they have not been specifically calibrated, which is typically the situation in global 
impact studies (Rosenzweig et al., 2013a). The model results were compared to measured 
data from four contrasting growing environments. The modeling groups were provided with 
weather data, soil characteristics, soil initial conditions, management and flowering and 
harvest dates for each site. Although only four locations were tested in the AgMIP Wheat 
Pilot study, this limitation is partially compensated for by the diversity of the sites ranging 
from high to low yielding, from short to long season, and irrigated and not irrigated situations. 

Two main approaches to evaluate the accuracy and uncertainty of the AgMIP wheat model 
ensemble were followed. First we evaluated the range of errors and the average error of the 
models for multiple growth variables, including both in-season and end-of-season variables. 
Secondly, we evaluated two ensemble-based models, the mean (e-mean) and the median (e- 
median) of the simulated values of the ensemble members. Finally, we studied how the error 
of e-mean and e-median changed with the size of the ensemble. 
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Materials and Methods 

Experimental data 

Quality-assessed experimental data from single crops at four contrasting locations 
representing diverse agro-ecological conditions were used. The locations were Wageningen, 
The Netherlands (NL; Groot & Verberne, 1991), Balcarce, Argentina (AR; Travasso et al., 

2005) , New Delhi, India (IN; Naveen, 1986), and Wongan Hills, Australia (AU; Asseng et al., 
1998). Typical regional crop management was used at each site. In all experiments, the plots 
were kept weed-free, and plant protection methods were used as necessary to minimize 
damage from pests and diseases. Crop management and cultivar information, as given to each 
individual modeling group, are given in Table SI in supplemental. 

Daily values of solar radiation, maximum and minimum temperature and precipitation 
were recorded at weather stations at or near the experimental plots, except for IN solar 
radiation which was obtained from the NASA POWER dataset of modeled data (Stackhouse, 

2006) that extends back to 1983. Daily values of 2-meter wind speed (m s" 1 ), dew point 
temperature (°C), vapor pressure (hPa), and relative humidity (%) were estimated for each 
location from the NASA Modem Era Retrospective -Analysis for Research and Applications 
(Bosilovich et al., 2011), except for NL wind speed and vapor pressure that were measured on 
site. Air CO 2 concentration was taken to be 360 ppm at all sites. A weather summary for each 
site is shown in Fig. SI in supplemental. 

For all sites, end-of-season (i.e. ripeness-maturity) values for GY (t DM ha" 1 ), total 
aboveground biomass (AGBM m , t DM ha" 1 ), total aboveground nitrogen (N; AGN m , kg N ha" 
'), and grain N (GN m , kg N ha" 1 ) were available. From these values, biomass harvest index 
(HI = 100 x GY/AGBM m , %), N harvest index (NHI = 100 x GN m /AGN m , %), and grain 
protein concentration (GPC = 0.57 x GN m /GY, % of grain dry mass) were calculated. In- 
season measurements included leaf area index (LAI, m 2 m" 2 ; 15 measurements in total), total 
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1 

aboveground biomass (AGBM, t DM ha" ; 28 measurements), total aboveground N (AGN, kg 

2 

N ha" 1 ; 27 measurements) and soil water content to maximum rooting depth (mm, 28 

3 

measurements). Plant-available soil water to maximum rooting depth (PASW, mm) was 

4 

calculated from the measured soil water content by layer ( 0 v , vol%), the estimated lower 

5 

limit of water extraction (LL, vol%) , and the thickness of the soil layers (d, m): 

6 

pasw=£4x(®,,-ll,) (d 

i=i 

7 

where k is the number of sampled soil layers. 

8 

Based on the critical N dilution curve of wheat (Justes et al., 1994), a N nutrition index 

9 

(NNI, dimensionless) was calculated to quantify crop N status. Although this curve is 

10 

empirical, it is based on solid theoretical grounds (Lemaire & Gastal, 1997). Climatic 

11 

conditions can affect growth and N uptake differently, but the NNI reflects these effects in 

12 

terms of crop N needs (Gonzalez-Dugo et al., 2010, Lemaire et al., 2008). For a given 

13 

AGBM, NNI was calculated as the ratio between the actual and critical ( N c ; g N g" 1 DM) 

14 

AGN concentrations defined by the critical N dilution curve (Justes et al., 1994): 

15 

N c = 5.35 x AGBM- 0 ' 442 (2) 

16 

If the NNI value is close to 1 it indicates an optimal crop N status, a value lower than 1 

17 

indicates N deficiency and a value higher than 1 indicates N excess. 

18 

Models and setup of model intercomparison 

19 

The models considered here were the 27 wheat crop models (Table S2 in supplemental) used 

20 

in the AgMIP Wheat Pilot study (Asseng et al, 2013). All of these models have been 

21 

described in publications and are currently in use. Not all models simulated all measured 

22 

variables, either because the models did not simulate them or because they were not in the 

23 

standard outputs. Of the 27 models, 23 models simulated PASW values, and 20 simulated 
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AGN and GN, and therefore NNI and GPC could be calculated for these 20 models. NHI 
could be calculated for 19 models. 

All modeling groups were provided with daily weather data (i.e. precipitation, minimum 
and maximum air temperature, mean relative air humidity, dew point temperature, mean air 
vapor pressure, global radiation and mean wind speed), basic physical characteristics of soil, 
initial soil water and N content by layer and crop management information (Table S2 in 
supplemental). No indication of how to interpret or convert this information into parameter 
values was given to the modelers. Modelers were provided with observed anthesis and 
maturity dates for the cultivars grown at each site. Qualitative information on vernalization 
requirements and daylength responses were also provided. 

In the simulations, phenology parameters were adjusted to reproduce the observed anthesis 
and maturity dates, but otherwise models were not specifically adjusted to the growth data, 
which were only revealed to the modelers at the end of the simulation phase of the project. 
Modelers were instructed to keep all parameters except for genotypic coefficients constant 
across all four sites. 

For three of the four sites, the data used here were previously available in the literature, so 
some of these data may have been used in the past with some models as part of larger 
datasets. If so, this would concern only some of the data used here, only a few models and 
only part of the data used for testing and model calibration. We chose this over the alternative 
approach of only using unpublished data to avoid other potential problems (Kersebaum et al., 
2007, Palosuo et al., 2011, Rotter et al., 2012). 

Except for the four Expert-N models which were run by the same group, all models were 
run by different groups without communication between the groups regarding the 
parameterization of the initial conditions or cultivar specific parameters. In most cases, the 
model developers ran their own model. 
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1 

Model evaluation 

2 

Many different measures of the discrepancies between simulations and measurements have 

3 

been proposed (Wallach, 2006). We concentrated on the root mean squared error (RMSE) and 

4 

the root mean squared relative error (RMSRE), where each error is expressed as a percentage 

5 

of the observed value. The RMSE has the advantage of expressing error in the same units as 

6 

the variable. For comparing very different environments likely to give a broad range of crop 

7 

responses, the relative error may be more meaningful than the absolute error as it gives more 

8 

equal weight to each measurement. However, RMSRE needs to be interpreted with care 

9 

because it is very sensitive to errors when measured values are small, as occurred for several 

10 

early-season growth measurements. 

11 

RMSE was calculated as the square root of the mean squared error (MSE). MSE for 

12 

model m and for a particular variable (MSE,„) was calculated as: 

13 

(3) 

TV ;=i 

14 

where » is the value of the zth measurement of this variable, y m , is the corresponding value 

15 

simulated by model m, and N is the total number of measurements of this variable (i.e. the 

16 

sum over sites and over sampling dates per site for in-season variables). 

17 

RMSRE was calculated as: 

18 

RMSRE m =100x (4 ) 

y t J 

19 

To assess whether a model that simulates well for one variable also performs well for 

20 

other variables, Pearson’s product-moment correlation between the RMSE or RMSRE value 

21 

of each model was calculated across the variables. The adjusted two-sided P-values (^-values) 

22 

resulting from the correction for multiple tests were calculated and reported here. 
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Multimodel ensemble estimators 

We considered two models that are based on the ensemble of model simulations. The first 
ensemble estimator, e-mean, is the mean of the model simulations. The second ensemble 
estimator, e-median, is the median of the individual model simulations. For each of these 
ensemble models, e-mean and e-median, we calculated the same criteria as for the individual 
models, namely MSE, RMSE, and RMSRE. 

In order to explore how e-mean MSE and e-median MSE varied with the number of 
models in the ensemble, we performed a bootstrap calculation for each value of M ’ (number 
of models in the ensemble) from 1 to 27. For each ensemble size M’ we drew B = 25,600 
bootstrap samples of M’ models with replacement, so the same model might be represented 
more than once in the sample. A preliminary analysis showed that the results were essentially 
unchanged beyond 3,000 bootstrap samples. The final estimate of MSE for e-mean was then: 


1 i b n 2 

MSEe-msm “ Te-mean,/ ) (5) 

B A i=1 /=1 

where T*_ TOOT!>i ' is the e-mean estimate in bootstrap sample b of the zth measurements of this 
variable, given by: 


Tc. 


M’ 


M' 



m=\ 


For e-median the estimate of MSE was calculated as: 


( 6 ) 


MSE e-med iim = " ^median,; f (?) 

B h=l i=1 

In the case of e-mean, we can calculate the theoretical expectation of MSE analytically as 
a function of M\ Consider a variable at a particular site. Let ju* represent the true expectation 
of model simulations for that site (the mean over all possible models), and let \i iM , represent 
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an e-mean simulation which is based on a sample of models of size M’. The expectation of 
MSE (expectation over possible samples of M’ models) for e-mean is then: 


£(MSE m ,) = £ 


" 1 N 


i n 

-Ye 


( * * /v \2 


I N 

--Y 




+ 


var U) 

M 


(7) 


where var(j).) is the variance of the simulated values for the different models. The first term 

in the sum in (equation 8) is the squared bias of e-mean, when e-mean is based on a very large 
number of models. The second term is the variance of the model simulations divided by M. 
ju* can be estimated as the average of the simulations over all the models in our study, and 
var (j7) can be estimated as the variance of those model simulations. 

All calculations and graphs were made using the R statistical software R 3.0.1 (R Core 
Team, 2013). Pearson’s product-moment correlation P-values were adjusted for false 
discovery rate using the ‘LBE’ package (Dalmasso et al., 2005), and bootstrap sampling used 
the R function sampleQ. 


13 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 


Global Change Biology 


Page 16 of 49 


Results 

Evaluation of a population of wheat crop models 

In most cases, measured in-season LAI, PASW, AGBM, AGN, and NNI, and end-of-season 
GY and GPC values were within the range of model simulations (Fig. 1, 2). Even though 
measured GY ranged from 2.50 to 7.45 t DM ha" 1 across the four sites, the ranges of 
simulated GY values were similar at the four sites with an average range between minimum 
and maximum simulations of 1.64 t DM ha" 1 (Fig. 2a). The range between minimum and 
maximum simulations for GPC was also comparable at the four sites, averaging 7.1 
percentage points (Fig. 2b). 

On average over all models, the RMSRE was 29% (Fig. 3a and Table S3 in 
supplemental), and RMSE was 1.25 t DM ha" 1 for GY (Fig. 3b and Table 1 and Table S4 in 
supplemental). The uncertainty in simulated GY was large, with RMSRE ranging from 8% to 
73% among the 27 models, but 80% of the models had an RMSRE for GY comprised 
between 14% and 47% (Fig. 3a). For the other end-of-season variables RMSRE ranged from 
7% to 60% for HI (averaging 24%), 22 to 61% for GN (averaging 38%), 15% to 52% for NHI 
(averaging 26%), and 8% to 122% for GPC (averaging 34%; Fig. 3a). For the in-season 
variables with multiple measurements per site, the RMSRE ranged from 48% to 1496% for 
LAI, 37% to 355% for PASW, 41% to 542% for AGBM, 49% to 472% for AGN, and 16% to 
104% for NNI (Fig. 3a). 

Of the three models with the smallest RMSE for GY, only the second-ranked model had 
RMSE values below the average of all models for all variables considered (Table 1). The 
other two models had an RMSE substantially higher than the average for at least one variable. 
The first- and second-ranked models simulated GY closely because of compensating errors. 
They underestimated LAI around anthesis and final AGBM which was compensated for by 
overestimating HI. For instance, the first-ranked model simulated that the canopy intercepted 
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1 

83%, 74% and 51% of the incident radiation around anthesis in AR, IN and NL, respectively, 

2 

while according to measured LAI values the percentage of radiation interception was close to 

3 

93% at the three sites (assuming an extinction coefficient of 0.55, an average value reported 

4 

for wheat canopies (Sylvester-Bradley el al., 2012)). This model compensated by having 

5 

unrealistically high HI values that were 19% to 93% higher than measured HI. Theoretical 

6 

maximum HI has been estimated at 62-64% for wheat (Foulkes et al ., 201 1), while this model 

7 

had simulated values up to 69% (in NL). The third-ranked model showed no significant 

8 

compensation of errors. This model overestimated LAI around anthesis by 16% in AR and 

9 

NL, but this translated into only a small effect on intercepted radiation, since the canopy 

10 

intercepted more than 90% of incident radiation based on observed LAI. 

11 

Relation between the error for grain yield and that for underlying variables 

12 

There was little relation between the errors for different variables (Fig. 3 a, b). There were 

13 

some exceptions however. Notably, RMSE for AGBM was highly correlated with that for 

14 

GY, and that for AGN was correlated with GN (Fig. 4). Similarly, RMSE for AGN was 

15 

highly correlated with that for LAI, PASW, and NNI. Finally, RMSE for NNI was correlated 

16 

with that for PASW, HI, and GN and to a lesser extent with that for NNI. RMSE for GPC was 

17 

not significantly correlated with any other variable. Overall, the correlations between RMSRE 

18 

for different variables were similar to that between RMSE for different variables (Fig. S2 in 

19 

supplemental). 

20 

Multimodel ensemble estimators 

21 

Two multimodel ensemble estimators were tested. The first, the e-mean, uses the mean of the 

22 

simulations of the ensemble members, a common practice in climate ensemble modeling 

23 

(Knutti, 2010). The second, the e-median, uses the median of the simulations of the ensemble 

24 

members. The e-median is expected to be less sensitive to outlier simulations than e-mean and 

25 

therefore provide more robust estimates. 
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The e-median and e-mean values gave good agreement with measured values in almost 
all cases, despite the fact that the simulations of the individual models varied considerably 
(Fig. 1, 2). The e-median and e-mean models were much better than the average over models 
for all responses (Fig. 3). For most variables, e-mean and e-median had similar RMSE and 
RMSRE values, and their ranking among all models was close (Table 1 and Supplementary 
Table S3, S4). The largest difference in ranks was for RMSE for GPC, where e-median was 
ranked 3 and e-mean was ranked 7. 

For most variables, e-mean and e-median were comparable to the best single model for 
that variable (Fig. 3a, b). When e-median was ranked with the other models based on 
RMSRE, it ranked fourth for GY and third for GPC (Table S3 in supplemental); and first for 
GY and third for GPC when ranked based on RMSE (Table S4 in supplemental). One way to 
quantify the overall skill of e-mean and e-median is to consider the sum of ranks over all the 
variables. The sum of ranks based on RMSE for the 1 0 variables analyzed in this study was 
37 for e-median and 45 for e-mean, while the lowest sum of ranks for an individual model 
(among the 17 models that simulated all variables) was 53 (Table S3 in supplemental). If we 
only considered the four variables simulated by all 27 models (i.e. LAI, AGBM, GY, and HI), 
the sum of ranks for e-median and e-mean was 15 and 17, respectively, while the best sum of 
ranks for an individual model with these four variables was 28. 

In order to analyze the relationship between the number of models in an ensemble and the 
RMSE of both e-mean and e-median, we used a bootstrap approach to create a large number 
of ensembles of different sizes. The RMSE of both e-mean and e-median in each bootstrap 
ensemble was calculated and averaged over bootstrap samples. The bootstrap results for e- 
mean were very close to the theoretical expectation of RMSE (Fig. 5). For all variables, the 
standard deviation of RMSE between bootstrap samples for e-mean decreased as the number 
of models in the ensemble increased. The average RMSE of e-median also decreased with the 
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1 number of models, in a manner similar to, but not identical to, the average e-mean RMSE. 

2 The differences were most pronounced for GPC (Fig. 5j). 
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Discussion 

Working with multimodel ensembles is well-established in climate modeling, but only 
recently has the necessary international coordination been developed to make this also 
possible for crop models (Rosenzweig et al., 2013b). Here we examined the performance of 
an ensemble of 27 wheat models, created in the context of the AgMIP Wheat Pilot study 
(Asseng et al., 2013). Multiple crop responses, including both end-of-season and in-season 
growth variables were considered. Among these, GY and GPC are the main determinants of 
wheat productivity and end-use value. The other variables helped indicate whether models are 
realistic and consistent in their description of the processes leading to GY and GPC. This 
provides more comprehensive information on crop system properties beyond GY and is 
essential for the analysis of adaptation and mitigation strategies to global changes (Challinor 
et al., 2014). 

In only a few cases there were significant correlations between a model’s error for one 
variable and its error for other variables. Several individual models had relatively small errors 
for GY or GPC and large errors for in-season variables, including two of the three models 
with the lowest RMSE for GY. These models arrived at accurate simulations of GY or GPC 
without simulating crop growth accurately and thus got the right answer for, at least in part, 
the wrong reasons. That is, models can compensate for structural inconsistency. It has been 
argued that interactions among system components are largely empirical in most crop models 
(Ahuja & Ma, 2011) and that model error is minimized with different parameter values for 
different variables (Wallach, 2011), which would explain why a model might simulate one 
variable well and not others. However, it remains unclear whether such compensation will be 
effective in a wide range of environments. The lack of correlation between model errors for 
different variables illustrate the need for crop model ensemble assessment for multiple 
variables (Challinor et al., 2014), as done in this study. 
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1 

The behavior of the median and mean of the ensemble simulations was similar. Both 

2 

estimators had much smaller errors and better skills than the average over models, for all 

3 

variables. In comparing the sum of ranks of error for all variables, which provides an 

4 

aggregated performance measure, the e-median was better than e-mean, but most importantly 

5 

both were superior to even the best performing model in the ensemble. Different measures of 

6 

performance might give slightly different results, but would not change the fact that e-median 

7 

and e-mean compare well with even the best models. 

8 

E-mean and e-median had small errors in simulating not only end-of-season variables but 

9 

also in-season variables. This suggests that multimodel ensembles could be useful not only for 

10 

simulating GY and GPC, but also for relating those results to in-season growth processes. 

11 

This is important if crop model ensembles are to be useful in exploring the consequences of 

12 

global change and the benefits of adaptation or mitigation strategies. 

13 

A fundamental question is the origin of the advantage of ensemble predictors over 

14 

individual models. Two possible explanations relate to compensation among errors in 

15 

processes descriptions and to more coverage of the possible crop and soil phase spaces. 

16 

Certain models had large errors with compensations to achieve a reasonable yield 

17 

simulation. In those cases, e-median can supply a better estimate when multiple responses are 

18 

considered, since it gives reasonable results for all variables. In other cases, it is simply the 

19 

fact that the errors in the different models tend to compensate each other well, that makes e- 

20 

median the best estimator over multiple responses. The compensation of errors among models 

21 

comes, at least in part, from the fact that models do not produce random outputs but are 

22 

driven by environmental and management inputs and bio-physical processes and therefore 

23 

they tend to converge to the measured crop response. It is an open question however as to 

24 

whether the superiority of crop model ensemble estimators compared to individual models 
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extends to conditions not tested in this study. In particular, will this still be the case if the 
models are used to predict the impact of climate change? 

For climate models, the main reason for the superiority of multimodel ensemble 
estimators is that better coverage of the whole possible climate phase space leads to greater 
consistency (Hagedorn et al., 2005). An analogous advantage holds as well for crop model 
ensembles, they have more associated knowledge and represent more processes than any 
individual model. Each of the individual models has been developed and calibrated based on a 
limited data set. The ensemble simulators are in a sense averaging over these data sets, which 
gives them the advantage of a much broader data base than any individual model and thus 
reduces the need for site- and varietal-specific model calibration. 

The use of ensemble estimators to answer new questions in the future poses specific 
questions regarding the best procedure for creating an ensemble. Several of these questions 
have been debated in the climate science community (Knutti, 2010), but not always in a way 
that is directly applicable to crop models. One question is how performance varies with the 
number of models in the ensemble. Flere we found that the change in ensemble error ( MSE m .) 

with the number of model in the ensemble (M ) follows the expectation of MSE. Thus when 
planning ensemble studies, one can estimate the potential reduction in MSE m , and therefore, 

do a costs vs. benefits analysis for increasing M . In the ensemble studied here, for all the 
variables, MSE for an ensemble of 10 models was close to the asymptotic limit for very large 

M . 

Other questions include how to choose the models in the ensemble, and whether one 
should weight the models in the ensemble differently, based on past performance and 
convergence for new situations (Tebaldi & Knutti, 2007). In this respect the crop modeling 
community might employ some of the ensemble weighting methods developed by the climate 
modeling community (Christensen et al., 2010). There are also questions about the possible 
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1 

multiple uses of models. Would it be advantageous to have multiple simulations, based on a 

2 

diversity of initial conditions (including ‘spin-up’ periods for models that depend on 

3 

simulation of changes in soil organic matter) or multiple parameter sets from each model? In 

4 

any case, the first step is to document the accuracy of multimodel ensemble estimators in 

5 

specific situations, as done here. 

6 

In summary, by reducing simulation error and improving the consistency of simulation 

7 

results for multiple variables, crop model ensembles could substantially increase the range of 

8 

questions that could be addressed. A lack of correlation between end-of-season and in-season 

9 

errors in the individual models indicates that further work is needed to improve the 

10 

representation of the dynamics of growth and development processes leading to GY in crop 

11 

models. This is crucial for their application under changed climatic or management 

12 

conditions. 

13 

Most of the physical and physiological processes that are simulated in wheat models are 

14 

the same as for other crops. In fact, several of the models in this study have a generic structure 

15 

so that they can be applied to various crops, and for some of them the differences between 

16 

crops are simply in the parameter values. It is thus reasonable to expect that the results 

17 

obtained here for wheat are broadly applicable to other crop species. It would be worthwhile 

18 

to study whether these results also apply more generally to biological and ecological system 

19 

models. 
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1 Supporting Information 

2 Additional Supporting Information may be found in the online version of the article: 

3 Table SI. Details of the experimental sites and experiments provided to the modelers. 

4 Table S2. Name, reference and source of the 27 wheat crop models used in this study. 

5 Table S3. Root mean square relative error (RMSRE) for in-season and end-of-season 

6 variables. 

7 Table S4. Root mean square error (RMSE) for in-season and end-of-season variables. 

8 Figure SI. Weather data at the four studied sites. 

9 Figure S2. Correlation matrix for Pearson’s product-moment correlation (r) between the root 
10 mean squared relative error of simulated variables. 
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1 

Figure Captions 

2 

Fig. 1. Measured and simulated values of five in-season wheat crop variables for four 

3 

sites, (a-d) Leaf area index (LAI), (e-h) plant-available soil water (PASW), (i-1) total 

4 

aboveground biomass (AGBM), (m-p) total aboveground nitrogen (AGN), and (q-t) nitrogen 

5 

nutrition index (NNI) versus days after sowing in The Netherlands (NL), Argentina (AR), 

6 

India (IN) and Australia (AU). Symbols are single measurements and solid lines are medians 

7 

of the simulations (i.e. e-median). Dark grey areas indicate the 10th to 90th percentile range 

8 

and light grey areas the 25th to 75th percentile range of the values generated by different 

9 

wheat crop models. Twenty-seven models were used to simulate LAI and AGBM, 24 to 

10 

simulate PASW, 20 to simulate AGN and NNI. In e-h the horizontal red lines indicate 50% 

11 

soil water deficit. 

12 

Fig. 2. Measured and simulated values of two major end-of-season wheat crop variables 

13 

for four sites. Measured (red crosses) and simulated (box plots) values for end-of-season (a) 

14 

grain yield (GY) and (b) grain protein concentration (GPC) are shown for The Netherlands 

15 

(NL), Argentina (AR), India (IN) and Australia (AU). Simulations are from 27 different 

16 

wheat crop models for GY and 20 for GPC. Boxes show the 25th to 75th percentile range, 

17 

horizontal lines in boxes show medians, and error bars outside boxes show the 10th to 90th 

18 

percentile range. 

19 

Fig. 3. Wheat crop model errors for in-season and end-of-season variables, (a) Root mean 

20 

squared relative error (RMSRE) and (b) root mean squared error (RMSE) for in-season leaf 

21 

area index (LAI), plant-available soil water (PASW), total aboveground biomass (AGBM), 

22 

total above ground nitrogen (AGN), nitrogen nutrition index (NNI), and for end-of-season 

23 

grain yield (GY), biomass harvest index (HI), grain nitrogen yield (GN), nitrogen harvest 

24 

index (NHI), and grain protein concentration (GPC). Twenty-seven models were used to 

25 

simulate LAI, AGBM, GY, and HI, 20 to simulate AGN, GN, GPC and NNI, 24 to simulate 
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PASW, and 19 to simulate NHI. In a for GY the models are sorted from left to right in the 
order of increasing RMSE and this order of models was used to plot all other variables. The 
horizontal solid blue line shows RMSE or RMRSE averaged over all models and the 
horizontal red line shows RMSE or RMRSE for the median simulation of all models (e- 
median). 

Fig. 4. Correlation matrix for Pearson’s product-moment correlation ( r ) between the 
root mean squared error of simulated variables. In-season variables: leaf area index (LAI), 
plant-available soil water (PASW), total aboveground biomass (AGBM), total above ground 
nitrogen (AGN), nitrogen nutrition index (NNI). End-of-season variables: grain yield (GY), 
biomass harvest index (HI), grain nitrogen yield (GN), nitrogen harvest index (NHI), and 
grain protein concentration (GPC). Twenty-seven models were used to simulate LAI, AGBM, 
GY, and HI, 20 to simulate AGN, GN, GPC and NNI, 24 to simulate PASW, and 19 to 
simulate NHI. The numbers above the diagonal gap are r values and the numbers below are 
one-sided ^-values (adjusted P- values for false discovery rate). The color (for r values only) 
and the shape of the ellipses indicate the strength (the narrower the ellipse the higher the r 
value) and the direction of the correlation, respectively. 

Fig. 5. How the number of models in an ensemble affects error estimates. Average root 
mean squared error (RMSE) (± 1 s.d.) of e-mean and e-median for in-season (a) leaf area 
index (LAI), (c) plant-available soil water (PASW), (e) total above ground biomass (AGBM), 
(g) total above ground nitrogen (AGN) and (i) nitrogen nutrition index (NNI) and for end-of- 
season (b) grain yield (GY), (d) biomass harvest index (HI), (f) grain nitrogen yield (GN), (h) 
nitrogen harvest index (NHI), and (j) grain protein concentration (GPC) versus number of 
models in the ensemble. Values are calculated based on 10,000 bootstrap samples. The solid 
line is the analytical result for RMSE as a function of sample size (equation (8)). The blue 
dashed line shows the RMSE for e-mean and the red dashed line the RMSE for e-median of 
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1 the multimodel ensemble. The black dashed line is the RMSE for the individual model with 

2 lowest sum of ranks for RMSE. For visual clarity the RMSE for e-mean is plotted for even 

3 numbers of models, and the RMSE for e-median for odd numbers of models. 
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Table 1 RMSE for in-season and end-of-season variables. Ensemble averages and e-mean and e-median values are based on 27 different models for 
LAI, AGBM, GY, and HI, 24 for PASW, 20 for AGN, GN, GPC and NNI, and 19 for NHI. Values for the three best models for GY (based on 
RMSE) simulation are also given. Data for each individual model are given in Table S4 in supplemental. The numbers in parenthesis indicate the 
rank of the models (including e-mean and e-median) where 1 indicates the model with the lowest RMSE (i.e. best rank) for that variable. For each 
variable the model with the lowest RMSE is in bold type. 


RMSE for in-season variables 


RMSE for end-of-season variables 


Estimator 

LAI 

(m 2 m" 2 ) 

PASW 

(mm) 

AGBM 
(t DM ha 1 ) 

AGN 

(kg N ha ') 

NNI 

(-) 

GY 

(t DM ha 1 ) 

HI 

(%) 

GN 

(kg N ha 1 ) 

NHI 

(%) 

GPC 

(% of grain 
DM) 

Average over all models 

1.90 

47 

2.07 

39 

0.35 

1.25 

8.5 

38 

18.7 

3.93 

Model ranked 1 for GY 

2.31 (23) 

60 (21) 

2.26(17) 

89 (21) 

0.92 (22) 

0.42 (2) 

20.0 (28) 

100 (22) 

23.6(18) 

6.91 (21) 

Model ranked 2 for GY 

1.24 (7) 

36(9) 

1.71 (13) 

24(8) 

0.26 (8) 

0.56 (4) 

7.2(16) 

27 (9) 

9.1 (2) 

2.75 (9) 

Model ranked 3 for GY 

1.75 (16) 

63 (22) 

1.01 (3) 

22(7) 

0.21 (4) 

0.63 (5) 

3.8 (5) 

29(10) 

11.7(5) 

2.13 (6) 

e-median 

1.20 (6) 

27 (3) 

1.20 (6) 

15(3) 

0.25 (7) 

0.41 (1) 

2.8 (2) 

22 (5) 

8.8 (1) 

1.57 (3) 

e-mean 

1.29 (8) 

27(5) 

1.19(5) 

13(1) 

0.24 (6) 

0.49 (3) 

2.2 (1) 

23 (6) 

9.8 (3) 

2.32 (7) 
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Table SI. Details of the experimental sites and experiments provided to the modelers. Adapted from Asseng et al. (2013). 



Site 





NL 

AR 

IN 

AU 

Site description 

Environment 





High-yielding long-season 

High/medium- yielding 

Irrigated short-season 

Low-yielding rain-fed short- 



medium-season 


season 

Regional representation 

Western and northern Europe 

Argentina, northern China, 

India, Pakistan, southern 

Australia, southern Europe, 



western USA 

China 

northern Africa, South 
Africa, Middle East 

Location name 

Wageningen (’The Bouwing’) 

Balcarce 

New Delhi 

Wongan Hills 


The Netherlands 

Argentina 

India 

Australia 

Coordinates 

51° 58’ N, 05° 37’ E 

37° 45’ S, 58° 18’ W 

28° 22’ N, 77° T E 

30° 53’ S, 116° 43’ E 

Soil characteristics 





Soil type a 

Silty clay loam 

Clay loam 

Sandy loam 

Loamy sand 

Rooting depth (cm) 

200 

130 

160 

210 

Apparent bulk density (m 3 m‘ 3 ) 

1.35 

1.1 

1.55 

1.41 

Top soil organic matter (%) 

2.52 

2.55 

0.37 

0.51 

pH 

6.0 

6.3 

8.3 

5.7 

Maximum plant available soil water (mm to 
maximum rooting depth) 

354 

222 

109 

125 

Crop management 





Sowing density (seed m‘ 2 ) 
Cultivar 

228 

239 

250 

157 

Name 

Arminda 

Oassis 

HD2009 

Gamenya 

Vernalization requirement 

High 

Little 

None 

Little 

Daylength response 

High 

Moderate 

None 

Moderate 

Ploughed crop residue 

Potato (4 t ha 1 ) 

Maize (7 t ha" 1 ) 

Maize (1.5 t ha" 1 ) 

Wheat/weeds (1.5 t ha" 1 ) 

Irrigation (mm) 

0 

0 

383 

0 

N application (kg N ha" 1 ) 

120 (ZC30 b ) / 40 (ZC65) 

120 (ZC00) 

60 (ZC00) / 60 (ZC25) 

50 (ZC10) 

Initial top soil mineral N (kg N ha' 1 ) 

80 

13 

25 

5 

Sowing date 

21 Oct. 1982 

10 Aug. 1992 

23 Nov. 1984 

12 Jun. 1984 

Anthesis date 

20 Jun. 1983 

23 Nov. 1992 

18 Feb. 1985 

1 Oct. 1984 

Physiological maturity date 

1 Aug. 1983 

28 Dec. 1992 

3 Apr. 1985 

16 Nov. 1984 


a Saturated soil water content, drainage upper limit and lower limit to water extraction were provided for 10 to 30-cm thick soil layers down to the maximum rooting depth. 
b ZC, Zadoks stage(Zadoks et al.. 1974) at application is indicated in parenthesis (ZC00, sowing; ZC10, first leaf through coleoptile; ZC25, main shoot and five tillers; ZC30, 
pseudo stem erection; ZC65, anthesis half-way. 
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Table S2. Name, reference and source of the 27 wheat crop models used in this study. Modified from Asseng et al. (2013). 


Model (version) 


Reference to model description 


APSIM-Nwheats (V.1.55) 

APSIM (V.7.3) 

AquaCrop (V.3.1+) 

CropSyst (V.3. 04.08) 

DSSAT-CERES (V.4.0.1.0) 
DSSAT-CROPS1M (V.4.5. 1.013) 
Ecosys 

EPIC wheat (V. 11 02) 

Expert-N (V3.0.10) - CERES (V2.0) 
Expert-N (V3.0.10) - GECROS (V1.0) 

Expert-N (V3.0.10) - SPASS (V2.0) 


(Asseng et al., 2004, Asseng et al, 1998, Keating et al., 2003) 

(Keating et al., 2003) 

(Steduto et al., 2009) 

(Stockle et al., 2003) 

(Hoogenboom & White, 2003, Jones et al., 2003, Ritchie & Otter, 1985) 

(Hunt & Pararajasingham, 1995, Jones et al., 2003) 

(Grant et al., 2011) 

(Izaurralde et al., 2012, Kiniry et al., 1995, Williams et al., 1989) 

(Biernath et al., 2011, Priesack et al., 2006, Stenger et al., 1999) 

(Biernath et al., 2011, Priesack et al., 2006, Stenger et al., 1999, Yin & van Laar, 
2005) 

(Biernath et al., 2011, Priesack et al., 2006, Stenger et al., 1999, Wang & Engel, 

2000) 


Documentation/source (web link, e-mail address) 

http://www.apsim.info/Wiki/ 

http://www.apsim.info/Wiki/ 

http://www.fao.org/nr/water/aquacrop.html 

http://www.bsyse.wsu.edu/CS_Suite/CropSyst/index.html 

http://www.icasa.net/dssat/ 

http://www.icasa.net/dssat/ 

http://www.rr.ualberta.ca/en/Research/EcosysModellingProject.aspx 
http ://epicapex. brc . tamus.edu/ 

http://www.helmholtz-muenchen.de/en/iboe/expertn/ 

http://www.helmholtz-muenchen.de/en/iboe/expertn/ 

http://www.helmholtz-muenchen.de/en/iboe/expertn/ 


Expert-N (V3.0.10) - SUCROS (V2) 

FASSET (V.2.0) 

GLAM-wheat (V.2) 

HERMES (V.4.26) 

InfoCrop (V.l) 

LINTUL-4 (V.l) 

LINTUL-FAST (V.1.0) 

LPJmL (V.3.2) 

MCWLA- Wheat (V.2.0) 

MONICA (V.1.0) 

O'Leary-model (V.7) 

SALUS (V.1.0) 

Sirius (V.2010) 

SiriusQuality (V.2.0) 

STICS (V.1.1) 

WOFOST (V.7.1) 


(Biernath et al., 2011, Goudriaan & Van Laar, 1994, Priesack et al, 2006, Stenger et 
al, 1999) 

(Berntsen et al., 2003, Olesen et al., 2002) 

(Challinor et al. , 2004, Li et al. , 20 1 0) 

(Kersebaum, 2007, Kersebaum, 2011) 

(Aggarwal et al. , 2006) 

(Shibu et al., 2010) 

(Angulo et al., 2013) 

(Bondeau et al., 2007) 

(Tao et al, 2009) 

(Nendel et al., 2011) 

(O'Leary & Connor, 1996a, O'Leary & Connor, 1996b) 

(Basso et al., 2010, Senthilkumar et al., 2009) 

(Jamieson et al., 2000, Jamieson et al., 1998, Lawless etal., 2005) 

(Ferrise et al., 2010, He et al, 2012, Martre et al, 2006) 

(Brisson et al., 2003, Brisson et al, 2009, Brisson et al., 1998, Brisson et al., 2002) 
(Boogaard etal., 1998, Van Diepen etal., 1989) 


http://www.helmholtz-muenchen.de/en/iboe/expertn/ 

http://www.fasset.dk 

http://www.see.leeds.ac.uk/see- 

research/icas/climate_change/glam/glam.html 

http://www.zalf.de/en/forschung/institute/lsa/forschung/oekomod/hermes 

Request from nareshkumar.soora@gmail.com 

http://models.pps.wur.nl/models 

Request from frank.ewert@uni-bonn.de 

http://www.pik-potsdam.de/research/projects/lpjweb 

Request from taofl@igsmT.ac.cn 

http://monica.agrosystem-models.com 

Request from author (gjoleary@yahoo.com) 

http://www.salusmodel.net 

http://www.rothamsted.ac.uk/mas-models/sirius.html 
http://www 1 .clermont.inra.fr/siriusquality/ 
http://www7.avignon.inra.fr/agroclim_stics 
http://www.wofost.wur.nl 
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Table S3. Root mean square relative error (RMSRE) for in-season and end-of-season variables. 


RMSRE (%) < H 


Model* In-season End-of-season Sum of 



LAI 

PASW 

AGBM 

AGN 

NNI 

GY 

HI 

GN 

NHI 

GPC 


1 

199 

102 

159 

472 

104 

8.1 (1) 

57.3 (28) 

61.1 (22) 

31.1 (15) 

57.7 (19) 

85/29 

2 

398 

129 

89 

76 

33 

17.4 (9) 

19.7(16) 

36.6 (12) 

14.6 (3) 

25.5 (12) 

52/25 

3 

246 

142 

41 

67 

29 

15.6 (6) 

9.8 (4) 

35.1 (11) 

18.8(6) 

23.9 (10) 

37/10 

4 

716 

37 

164 

NA 

NA 

21.1 (12) 

17.4 (15) 

NA 

NA 

NA 

-111 

5 

319 

177 

129 

NA 

NA 

13.3 (2) 

24.3 (20) 

NA 

NA 

NA 

-111 

6 

171 

NA 

47 

NA 

NA 

19.3 (11) 

20.3 (17) 

NA 

NA 

NA 

— /28 

7 

1496 

50 

132 

60 

28 

23.5 (15) 

15.3 (11) 

23.2 (6) 

18.5 (4) 

58.3 (20) 

56/26 

8 

172 

95 

114 

123 

38 

13.7 (3) 

11.3 (6) 

29.5 (7) 

19.4 (9) 

36.9(15) 

40/9 

9 

140 

37 

67 

63 

16 

14.4 (5) 

13.3 (8) 

22.2 (2) 

22.8(11) 

10.7 (2) 

28/13 

10 

821 

68 

542 

384 

35 

16.4 (8) 

14.3 (9) 

44.1 (17) 

28(14) 

26.8 (13) 

61/17 

11 

692 

59 

52 

49 

56 

27.8 (17) 

23.5 (19) 

39.3 (15) 

48 (20) 

28.8 (14) 

85/36 

12 

133 

45 

103 

145 

48 

18.2(10) 

24.5 (21) 

NA 

NA 

NA 

— /3 1 

13 

745 

355 

296 

74 

87 

38.2 (22) 

25.2 (22) 

58 (21) 

18.5 (5) 

17.4 (5) 

75/44 

14 

1150 

150 

53 

72 

32 

42.5 (23) 

16.6(13) 

31.6 (9) 

19.2(8) 

121.8(22) 

75/36 

15 

58 

40 

84 

75 

34 

22.8 (14) 

7(2) 

37.9 (14) 

40.3 (19) 

23.2 (7) 

56/16 

16 

219 

NA 

196 

116 

42 

49.6 (28) 

49.5 (26) 

55.9(19) 

52.3 (21) 

23.3 (8) 

102/54 

17 

699 

97 

41 

55 

36 

22.8 (13) 

16.7 (14) 

22.6 (4) 

19.1 (7) 

8(1) 

39/27 

18 

749 

65 

126 

82 

29 

43.8 (25) 

9.8 (4) 

47.1 (18) 

32.1 (16) 

38.3(17) 

80/29 

19 

156 

101 

187 

52 

41 

30.9 (20) 

59.9 (29) 

34.5 (10) 

27.1 (13) 

39.9(18) 

90/49 

20 

109 

45 

356 

230 

37 

33.6 (21) 

26.7 (23) 

56.6 (20) 

34.6(18) 

23.7 (9) 

91/44 

21 

663 

94 

69 

76 

35 

28.9(18) 

28.9 (24) 

22.9 (5) 

21.3(10) 

37.6(16) 

73/42 

22 

773 

NA 

193 

192 

49 

29.9(19) 

11.8 (7) 

30 (8) 

23.8(12) 

15.3 (4) 

50/26 

23 

294 

40 

199 

NA 

NA 

45 (26) 

44.6 (25) 

NA 

NA 

NA 

— /5 1 

24 

1085 

79 

77 

73 

61 

27(16) 

22.3 (18) 

37.6 (13) 

33.6(17) 

24.8(11) 

75/34 

25 

48 

59 

91 

NA 

NA 

43 (24) 

15.7(12) 

40.9 (16) 

NA 

64(21) 

— /36 

26 

75 

59 

231 

NA 

NA 

48.6 (27) 

15.3(10) 

NA 

NA 

NA 

— /37 

27 

1199 

NA 

306 

NA 

NA 

72.6 (29) 

53.8 (27) 

NA 

NA 

NA 

— /56 

e-median 

242 

64 

113 

66 

25 

14(4) 

7.1 (3) 

22.5 (3) 

13.7 (1) 

14.2 (3) 

14/7 

e- mean 

442 

70 

133 

79 

24 

15.6 (7) 

5.7 (1) 

19.5 (1) 

14.3 (2) 

20.8 (6) 

17/8 

Average over _ _ . 

92 

154 

127 

44 

29.2 

24.3 

38.3 

27.5 

35.3 

_ 


all models 


Results are based on 27 different wheat crop models for LAI, AGBM, GY and HI, 20 for AGN, GN, GPC and NNI, 24 for PASW, and 19 for NHI. 

The models are sorted from top to bottom in the order of increasing RMSE for GY. For each variable the model with the lowest RMSRE is in bold 
type. 

NA, variables not available for a model. For end-of-season variables, the numbers in parentheses indicate the rank of the models (including e-mean 
and e-median) for each variable. Ranks were not calculated for in-season variables because several of the in-season measurements were very small 
causing large relative errors even the absolute errors were reasonable. Therefore RMSRE for in-season variables should be looked at with caution. 

* Sum of rank of RMSRE for end-of-season variables/sum of rank of RMSRE for the variables simulated by all 27 models (i.e., LAI, AGBM, GY, 
HI). For the reason mentioned above the sum of rank did not include in-season variables. 
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Table S4. Root mean square error (RMSE) for in-season and end-of-season variables. 
RMSE 1 


In-season End-of-season Sum of 

Model ran k* 



LAI 

PASW 

AGBM 

AGN 

NNI 

GY 

HI 

GN 

NHI 

GPC 



(nr ni 2 ) 

(mm) 

(t DM ha" 

‘) (kg N ha" 1 ) 

(-) 

(t DM ha" 1 ) 

(%) 

(kg N ha" 1 ) (%) 

(% of grain DM) 


1 

2.31 (23) 

60 (21) 

2.26 (17) 

89 (21) 

0.92 (22) 

0.42 (2) 

20.0 (28) 

100 (22) 

23.6(18) 

6.91 (21) 

195/70 

2 

1.24(7) 

36 (9) 

1.71 (13) 

24(8) 

0.26 (8) 

0.56 (4) 

7.2(16) 

27 (9) 

9.1 (2) 

2.75 (9) 

85/40 

3 

1.75 (16) 

63 (22) 

1.01 (3) 

22 (7) 

0.21 (4) 

0.63 (5) 

3.8 (5) 

29(10) 

11.7 (5) 

2.13 (6) 

83/29 

4 

1.82(19) 

36 (8) 

1.64(12) 

NA 

NA 

0.66 (6) 

6.3 (13) 

NA 

NA 

NA 

— /50 

5 

1.13 (5) 

46(18) 

2.30(18) 

NA 

NA 

0.69 (7) 

9.9 (24) 

NA 

NA 

NA 

— /54 

6 

1.81 (18) 

NA 

1.41 (7) 

NA 

NA 

0.74 (8) 

7.6(17) 

NA 

NA 

NA 

— /50 

7 

3.34 (28) 

42 (16) 

1.44 (9) 

17(4) 

0.29(11) 

0.77 (9) 

6.2 (12) 

21(3) 

11.5 (4) 

6.39 (20) 

116/58 

8 

1.33 (10) 

26 (2) 

0.97 (2) 

30(10) 

0.28 (9) 

0.78 (10) 

4.0 (6) 

20 (2) 

13.6(9) 

4.04(16) 

76/28 

9 

1.30(9) 

32 (7) 

0.87(1) 

14(2) 

0.16 (1) 

0.81 (11) 

4.6 (9) 

20(1) 

14.5(10) 

1.19(2) 

53/30 

10 

1.93 (21) 

50 (20) 

2.58 (23) 

55 (19) 

0.30(12) 

0.88 (12) 

4.6 (8) 

39(15) 

19.3(14) 

2.85 (10) 

154/64 

11 

2.78 (26) 

37 (14) 

3.16(28) 

61 (20) 

0.36(16) 

1.06(13) 

9.1 (22) 

49(18) 

34.2(21) 

3.65 (15) 

193/89 

12 

1.12 (4) 

37 (12) 

2.15 (15) 

32(13) 

0.30(13) 

1.21 (14) 

8.1 (18) 

NA 

NA 

NA 

-751 

13 

4.50 (29) 

77 (23) 

1.90(14) 

92 (22) 

0.79 (21) 

1.24(15) 

8.5 (21) 

31 (13) 

13.5 (7) 

2.01 (5) 

170/79 

14 

1 .90 (20) 

37(13) 

2.60 (24) 

21 (6) 

0.20 (3) 

1.25 (16) 

6.9(15) 

26(8) 

12.1 (6) 

13.2 (22) 

133/75 

15 

1.12(3) 

30 (6) 

1.62(10) 

30(11) 

0.20 (2) 

1.26(17) 

2.9 (3) 

60 (21) 

29.2(19) 

3.42(13) 

105/33 

16 

0.91 (1) 

NA 

1.43 (8) 

39(15) 

0.43 (19) 

1.34(18) 

15.5 (26) 

51(19) 

33.4 (20) 

3.47 (14) 

-753 

17 

2.99 (27) 

45 (17) 

1.07(4) 

51 (18) 

0.33(15) 

1.34(19) 

6.8(14) 

22 (4) 

13.6 (8) 

1.04 (1) 

127/64 

18 

1.45(11) 

37(11) 

2.31 (19) 

18(5) 

0.32(14) 

1.35 (20) 

3.7 (4) 

30(12) 

20.3(15) 

3.36(12) 

123/54 

19 

1.63(14) 

27(4) 

2.46(21) 

34 (14) 

0.45 (20) 

1.36 (21) 

18.8 (27) 

32(14) 

17.5(12) 

4.35 (17) 

164/83 

20 

1.53(13) 

41 (15) 

2.18(16) 

50(17) 

0.29(10) 

1.43 (22) 

8.4 (20) 

52 (20) 

21.8(16) 

2.70 (8) 

157/71 

21 

2.23 (22) 

25 (1) 

2.62 (25) 

28 (9) 

0.21 (5) 

1.56 (23) 

9.3 (23) 

29(11) 

15.8(11) 

4.55 (18) 

148/93 

22 

1.75 (17) 

NA 

2.73 (26) 

32(12) 

0.36(17) 

1.59 (24) 

4.1 (7) 

43 (17) 

18.0(13) 

1.64(4) 

-774 

23 

1.67(15) 

47 (19) 

2.47 (22) 

NA 

NA 

1.61 (25) 

14.3 (25) 

NA 

NA 

NA 

-787 

24 

2.69 (25) 

36 (10) 

1.64(11) 

47 (16) 

0.40(18) 

1.68 (26) 

8.1 (19) 

25 (7) 

22.1 (17) 

3.17(11) 

160/81 

25 

1.04(2) 

100 (24) 

2.42 (20) 

NA 

NA 

1.80 (27) 

4.8(11) 

43 (16) 

NA 

5.73(19) 

-760 

26 

1.52(12) 

112(25) 

3.76 (29) 

NA 

NA 

2.17(28) 

4.8 (10) 

NA 

NA 

NA 

-779 

27 

2.37 (24) 

NA 

3.07 (27) 

NA 

NA 

3.63 (29) 

20.3 (29) 

NA 

NA 

NA 

— /109 

e-median 

1.20(6) 

27 (3) 

1.20 (6) 

15(3) 

0.25 (7) 

0.41 (1) 

2.8 (2) 

22 (5) 

8.8 (1) 

1.57 (3) 

37/15 

e- mean 

1 .29 (8) 

27 (5) 

1.19(5) 

13 (1) 

0.24 (6) 

0.49 (3) 

2.2 (1) 

23 (6) 

9.8 (3) 

2.32 (7) 

45/17 

Average over . __ 
all models 

47 

2.07 

39 

0.35 

1.25 

8.5 

38 

18.7 

3.93 

- 


Results are based on 27 different wheat crop models for LAI, AGBM, GY and HI, 20 for AGN, GN, GPC and NNI, 24 for PASW, and 19 for NHI. 

The models are sorted from top to bottom in the order of increasing RMSE for GY. For each variable the model with the lowest RMSE is in bold 
type. 

' NA, variables not available for a model. The numbers in parentheses indicate the rank of the models (including e-mean and e-median) for each 
variable. 

* Sum of rank of RMSE for all variables/sum of rank of RMSE for the variables simulated by all 27 models (i.e., LAI, AGBM, GY, HI). 
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Figure SI. Weather data at the four studied sites. Mean weekly temperature (solid lines), 
cumulative weekly solar radiation (dashed lines), cumulative weekly rainfall (vertical solid 
bars) and irrigation (vertical open bars) in (a) Wageningen, The Netherlands, (b) Balcarce, 
Argentina, (c) New Delhi, India, and (d) Wongan Hills, Australia. Vertical arrows indicate (a) 
anthesis and (m) physiological maturity dates. 
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Figure S2. Correlation matrix for Pearson’ s product- moment correlation (r) between the root 
mean squared relative error of simulated variables. In-season variables: leaf area index (LAI), 
plant-available soil water (PASW), total aboveground biomass (AGBM), total above ground 
nitrogen (AGN), nitrogen nutrition index (NNI). End-of-season variables: grain yield (GY), 
biomass harvest index (HI), grain nitrogen yield (GN), nitrogen harvest index (NHI), and 
grain protein concentration (GPC). Twenty-seven models were used to simulate LAI, AGBM, 
GY, and HI, 20 to simulate AGN, GN, GPC and NNI, 24 to simulate PASW, and 19 to 
simulate NHL The numbers above the diagonal gap are r values and the numbers below are 
one-sided ^-values (adjusted P- values for false discovery rate). The color (for r values only) 
and the shape of the ellipses indicate the strength (the narrower the ellipse the higher the r 
value) and the direction of the correlation, respectively. 
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